Shadowserver dynamic config #2372

elsif2 · 2023-05-28T23:13:10Z

Dynamic configuration

Many thanks to Eric Halil for collaboration and testing on this update.

Parsers

intelmq.bots.parsers.shadowserver._config:
- Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. Please see intelmq/bots/parsers/shadowserver/README.md for details.

Collectors

intelmq.bots.collectors.shadowserver.collector_reports_api:
- The 'json' option is no longer supported as the 'csv' option provides better performance. Please see intelmq/bots/parsers/shadowserver/README.md for a sample configuration.

sebix · 2023-05-30T15:54:59Z

@elsif2 the license check is failing

elsif2 · 2023-05-30T16:00:53Z

From the log:

The following files have no copyright and licensing information:
* intelmq/bots/parsers/shadowserver/schema.json.test

License added.

kamil-certat

Sorry for the long waiting for the review. I have noted a few things that are important for me, mainly from the perspective of the IntelMQ user. I have marked with 🔴 things that are the most critical for me.

Generally, I like separating the schema from IntelMQ releases. I have unfortunately doubts in some decisions as I'd like to better control when and if the update is happening. I'd also recommend taking a look on how other bots handle such topic, e.g. ASN expert: https://github.com/certtools/intelmq/tree/bcccbf501ffa8b64c2ea011d457c3409ad1dd4cd/intelmq/bots/experts/asn_lookup

I have also identified a few points, where the code can break e.g. in our environment.

In addition, I'd recommend adding the regular update to the crontab installed with native packages: https://github.com/certtools/intelmq/tree/develop/contrib/cron-jobs

If you have any questions or need any help, you can hit me on the ShadowServer MM or email (but unfortunately I go on the vacation and be back at 21.08).

intelmq/bots/collectors/shadowserver/collector_reports_api.py

intelmq/bots/parsers/shadowserver/parser.py

intelmq/bots/parsers/shadowserver/README.md

intelmq/bots/parsers/shadowserver/_config.py

intelmq/bots/parsers/shadowserver/update_schema.py

intelmq/tests/bots/parsers/shadowserver/test_report_smb.py

kamil-certat · 2023-07-27T10:29:43Z

In addition, I'd like to ask if you can publish a policy on changes in the schema, the most important for me is in which cases you can change classification config (e.g. identifier, taxonomy)

sebix · 2023-07-27T15:49:10Z

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

elsif2 · 2023-07-27T20:41:02Z

In addition, I'd like to ask if you can publish a policy on changes in the schema, the most important for me is in which cases you can change classification config (e.g. identifier, taxonomy)

I started a Schema contract section to address concerns in the bots/parsers/shadowserver/README.md.

aaronkaplan · 2023-07-27T20:57:30Z

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

@sebix I had a good chat with elsif2 today and we said

I try out the changes, get it to work for myself
document things as I go
cross check with elsif2 as I got and we change things iteratively
we both start to write down sort of a "contract" / documentation on this . It's paramount to know for intelmq-users and -admins what the "contract" is: what fields may change dynamically, added fields will stay for ever, only new fields get added, how to deal with new fields? How often updates are needed? What happens when someone does not update? ETc. etc... If we write this down and make it clear, it is the basis for discussions and it helps also to improve the concept.

So, TL;DR: don't worry :) we are adressing concerns.

aaronkaplan

I share a lot of concerns with @kamil-certat here, but my main concern is: is there a way to be as little disruptive as possible? Or let me re-formulate: let's think hard how we can achieve this.

Because most of the time intelmq runs unattended, automatically, at night.
People don't notice it in the logs if something dynamic (like a config update ) happened.
Nor will they debug it.

We agreed in a chat today that

I would try out the change
We iterate over it and improve it
We document it
We document a type of "contract" between users/-admins and the dynamic update service
We need to have a opt-in/opt-out process for this.

I want to re-iterate that I think it is very very beneficial if intelmq users can easily get the latest and best shadowserver parsers. We just need to make sure it gets done right.

aaronkaplan · 2023-07-27T14:44:12Z

CHANGELOG.md


 #### Parsers
 - `intelmq.bots.parsers.shadowserver._config`:
  - Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360).
+  - Switch to dynamic configuration to decouple report schema changes from IntelMQ releases.  Please see intelmq/bots/parsers/shadowserver/README.md for details. (PR#2372)


I think we should also add a section in the official intelmq docs for the shadowserver feeds, since they are so crucial.

The README is not the right place to document the bot for the users.
user documentation is in docs/user/bots.rst (https://intelmq.readthedocs.io/en/latest/user/bots.html#shadowserver) and intelmq/etc/feeds.yml (https://intelmq.readthedocs.io/en/latest/user/feeds.html). For upgrade steps, use NEWS.md.
Please move the information from the README there.

aaronkaplan · 2023-07-27T14:46:56Z

intelmq/bots/parsers/shadowserver/README.md

+For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
+The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.
+
+For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.


Also - what happens when people don't update in air-gapped environments? Is the parsing fail-safe then ? Like HTML ignores fields and element attributes it does not understand?

aaronkaplan · 2023-07-27T14:49:31Z

intelmq/bots/parsers/shadowserver/_config.py

-        'classification.identifier': 'spam-url',
-    },
-}
+functions = {


I won't check this huge diff, but I assume that's valid since automatically generated stuff gets removed and moved to the dynamic confgen

The behavior is unchanged from the existing release. Errors are logged for new report types. Fields added to existing reports are ignored.

aaronkaplan · 2023-07-27T14:50:33Z

intelmq/bots/parsers/shadowserver/_config.py

+}
+
+
+def reload():


and have a mechanism to test the new config automatically. If it fails, the admin needs to know. And the developers need to know.

aaronkaplan · 2023-07-27T19:47:41Z

intelmq/bots/parsers/shadowserver/README.md

+For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
+The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.
+
+For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.


agree with sebix here: this would be a good initial step which helps it to work implicitly.

aaronkaplan · 2023-07-27T19:48:36Z

intelmq/bots/parsers/shadowserver/README.md

+    destination_queues:
+      _default: [shadowserver-parser-queue]
+    file_format: csv
+    api_key: "$API_KEY_received_from_the_shadowserver_foundation"


Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?

Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?

No. Some organizations do have separate report sets that can be handled independently using the reports option.

sebix · 2023-08-02T13:27:48Z

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

* we both start to write down sort of a "contract" / documentation on this . It's paramount to know for intelmq-users and -admins what the "contract" is: what fields may change dynamically, added fields will stay for ever, only new fields get added, how to deal with new fields? How often updates are needed? What happens when someone does not update? ETc. etc... If we write this down and make it clear, it is the basis for discussions and it helps also to improve the concept.

This is a good step forward!

So, TL;DR: don't worry :) we are adressing concerns.

I'm not worried personally, as I'm not directly affected, but many users are. I want to contribute my experience to get the change into operation smoothly.

Needless to say, the proposed approach has its advantages.

Here are some thoughts:

The delegation of the mapping to a third party outside of the IntelMQ software also deprives the IntelMQ developers & users of the possibility of reviewing the field mappings before they become active.
Until now, only collectors fetch the Shadowserver data via API or from local sources like IMAP, file, ticket system, etc. IntelMQ doesn't need to be connected to the internet to parse Shadowserver data. With this change merged, IntelMQ gains an additional requirement on the operation. Operators need to consider that and eventually adapt their installations or environments.
Another issue is that it will take more work to parse older or historical data. Do you remember that a few years ago, we parsed Shadowserver data covering a few years of data for statistical purposes? We had the older Shadowserver feed mappings at hand, thus no issues parsing older data.

aaronkaplan · 2023-08-09T13:07:17Z

Let me try to address some topics mentioned.

The delegation of the mapping to a third party outside of the IntelMQ software also deprives the IntelMQ developers & users of the possibility of reviewing the field mappings before they become active.

nope, since you can always fetch the newest mappings manually and review. So, I don't see a problem here.
But yes, we could and should try out the "air-gapped" mode and see if it's actually feasible before merging. Agreed with that.

Until now, only collectors fetch the Shadowserver data via API or from local sources like IMAP, file, ticket system, etc. IntelMQ doesn't need to be connected to the internet to parse Shadowserver data. With this change merged, IntelMQ gains an additional requirement on the operation. Operators need to consider that and eventually adapt their installations or environments.

Again see above, it will also work in airgapped mode and pulling the latest mappings will be similar to doing an intelmq update manually.

Another issue is that it will take more work to parse older or historical data. Do you remember that a few years ago, we parsed Shadowserver data covering a few years of data for statistical purposes? We had the older Shadowserver feed mappings at hand, thus no issues parsing older data.

This is a very relevant point. I think this needs to be in the contract that historic fields don't get remapped or changed or deleted.

kamil-certat · 2023-08-24T14:46:00Z

Thanks for changes, @elsif2 I see most of my technical concerns have been addressed :)

About the general discussion:

re: delegation of schema outside IMQ

I see this as a trade-off problem, but I think ShadowServer has a very good reason to delegate it outside as their reports are developing much quicker than IntelMQ. In addition, I see a possibility for automated upgrades of this schema as a big facilitation for admins - even with a very quick adoption in IMQ packages, you still need to upgrade the whole IntelMQ instead of just the schema. Yes, it has some cons, but I think the pros are stronger in this case.

re: air-gaped systems

This is why I'd suggest keeping default, regularly updated (e.g. once a month) schema in the repo. Then your air-gaped system can still work after installation, just without additional configuration it would have a delay in adoption of schema changes.

re: historical schema

This is a good point. But I'd just suggest keeping the history of schemas, in to ways:

do not delete old schema, create a history folder instead and store them all downloaded schemas, named after the schema version.
For cases when you don't have the schema, @elsif2 would it be possible, that ShadowServer publish a public catalog with all historical schemas?

As I understand, the schema is versioned by time, so it should be easy to choose the right one if you need to perform a reparsing of old reports - assuming that ShadowServer would ensure that date of creation a schema means, that starting this or the next day the schema is used (please clarify it in the contract)

CHANGELOG.md

kamil-certat · 2023-08-24T14:02:59Z

intelmq/bots/parsers/shadowserver/_config.py

+
+    __config.feedname_mapping.clear()
+    __config.filename_mapping.clear()
+    for schema_file in [__config.schema_file, __config.schema_base]:


Could you explain a little why load schema_base, that looks to be a test schema, every time? If the only usage is for testing, I'd suggest loading it just during tests by specifying bot's parameters.

If this has more usage, I'd suggest loading base first. Otherwise, the base can override the updated schema.

Added a test_mode to switch between schema files.

intelmq/bots/parsers/shadowserver/_config.py

intelmq/bots/parsers/shadowserver/parser.py

kamil-certat · 2023-08-24T14:22:03Z

intelmq/bots/parsers/shadowserver/README.md

+For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
+The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.
+
+For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.


I'm not sure which is better (starting schema vs. download when missing), but I can accept any of them.

elsif2 · 2023-08-24T14:51:26Z

The complete revision history of the schema files will be available at https://github.com/The-Shadowserver-Foundation/report_schema.

…tab suggestion.

kamil-certat · 2023-11-14T09:29:20Z

Thank you!

…can_exchange report.

Already reviewed some parts. Major parts are reviewed by Kamil and Aaron

aaronkaplan · 2023-11-28T13:22:35Z

looks like my requests have been met.
Testing again now.

sebix · 2023-11-29T16:12:42Z

I wanted to cancel the pending review for myself requested by Aaron, but it looks like I instead canceled one of Aaron's reviews 🤔

Anyway, I won't review this PR again, as I already reviewed some parts of it (core, docs). The major and more relevant parts have already been reviewed by Kamil and Aaron. I don't want to duplicate their efforts. I shared my thoughts on the change in general already above (#2372 (comment))

aaronkaplan · 2023-11-29T19:38:27Z

@sebix all good, I am still looking at it. It's a mega mega mega commit. Need to do a smaller one the next time to make the review process faster. Most of us have other duties / daytime jobs etc.

kamil-certat · 2023-11-30T10:26:06Z

This branch is currently in production usage on our instance, so far I do not see any issues.

DigitalTrustCenter · 2023-12-06T15:19:38Z

We would really like to see this feature merged in IntelMQ. We know the pain of manually merging the _config.py whenever new reports are added too well. The changes made to host(name) on the 18th of November showed the problem where we would require a different config from one day to another. The dynamic config would fix those kinds of issues as well.

We do however see a problem. We run a custom config with a couple of changes compared to the upstream config:

We changed the classification.identifer of scan_http_vulnerable from accessible-http to vulnerable-http. accessible-http is also used by scan_http, which is a completely different report.
event_honeypot_brute_force, event_honeypot_darknet, event_sinkhole_http and event_sinkhole map fields like tag to classification.identifer. We would like a constant value for classification.identifer, so we moved it to the constant_fields section with values honeypot-brute-force, darknet, sinkhole-http and sinkhole.
sandbox_url and sandbox_conn (unnecessary in our opinion) mix source and destination fields. We changed destination.fqdn and destination.to source.fqdn and source.url.

We would not be surprised if other organizations have their own changes that they made to the config.
A solution would be to allow us to configure a path to a JSON patch document (RFC 6902) in the config. This patch is applied to the schema JSON after a new version is downloaded in the update process.

kamil-certat · 2023-12-06T16:16:20Z

Hey, we indeed have a couple of custom changes (e.g. filters by tags, own identifiers etc.), but we use a sieve bot for that - and as long as the feed.code is stable, our bot handles that independently. I would recommend this approach instead of modifying ´_config.py` or patching schema

DigitalTrustCenter · 2023-12-12T15:45:40Z

Thanks for the pointer. We are now looking into improving our setup using the sieve bot. The only thing it cannot solve is the change of the fields for sandbox_url (you cannot add a new field with a value from another field). Maybe those fields should be changed in the upstream schema itself. However, that is not relevant for this PR.

elsif2 · 2023-12-12T16:13:57Z

The change occurred between 3.0.2 and 3.1.0. I will correct this in the schema.

3.0.2

sandbox_url = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip'),
    ],
    'optional_fields': [
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5hash'),
        ('source.url', 'url'),
        ('user_agent', 'user_agent', validate_to_none),
        ('source.fqdn', 'host', validate_fqdn),
        ('extra.', 'method', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-url',
    },
}

3.1.0

sandbox_url = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip', validate_ip),
    ],
    'optional_fields': [
        ('destination.fqdn', 'host', validate_fqdn),
        ('extra.http_request_method', 'method', validate_to_none),
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5', validate_to_none),
        ('destination.url', 'url', convert_http_host_and_url, True),
        ('user_agent', 'user_agent', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-url',
    },
}

sandbox_conn = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip', validate_ip),
        ('source.port', 'port', convert_int),
    ],
    'optional_fields': [
        ('destination.fqdn', 'host', validate_fqdn),
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5', validate_to_none),
        ('protocol.transport', 'protocol'),
        ('extra.', 'bytes_in', validate_to_none),
        ('extra.', 'bytes_out', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-conn',
    },
}

sebix · 2023-12-13T08:04:42Z

Thanks for the pointer. We are now looking into improving our setup using the sieve bot. The only thing it cannot solve is the change of the fields for sandbox_url (you cannot add a new field with a value from another field).

That is true. As a workaround, you could use the Jinja2 Template Expert or the Modify Expert

aaronkaplan

This has been possibly the longest running PR in intelmq.
But it's also the most complex one I have ever seen.
I think @elsif2 managed to stun us all.
Checking and testing this rigorously is pretty much out of the question.
But what I tested and was running on my setup, it looks good.

Thanks a ton for this important PR and sorry that it took months. Godspeed, shadowserver!

docs/user/bots.md

aaronkaplan · 2023-12-18T21:50:01Z

And all checks have passed :) merging in...

sebix · 2024-01-22T13:51:01Z

CHANGELOG.md

+- `intelmq.bots.collectors.shadowserver.collector_reports_api`:
+  - The 'json' option is no longer supported as the 'csv' option provides better performance.

 #### Parsers
 - `intelmq.bots.parsers.shadowserver._config`:
  - Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360).
- `intelmq.bots.parsers.shadowserver._config`:
+  - Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. 


The changelog entry is in the wrong section (3.2.0) instead of 3.2.2

can be fixed with #2447

elsif2 assigned sebix May 28, 2023

sebix added this to the 3.3.0 milestone Jul 19, 2023

kamil-certat self-requested a review July 24, 2023 10:02

aaronkaplan self-requested a review July 27, 2023 10:04

aaronkaplan self-assigned this Jul 27, 2023

aaronkaplan added the component: bots label Jul 27, 2023

kamil-certat reviewed Jul 27, 2023

View reviewed changes

elsif2 force-pushed the shadowserver-dynamic-config branch from 77ac732 to 3d6a87e Compare July 27, 2023 20:31

aaronkaplan previously requested changes Jul 27, 2023

View reviewed changes

kamil-certat requested changes Aug 24, 2023

View reviewed changes

elsif2 added 10 commits August 24, 2023 14:58

remove obsolete tests and data

19a1972

remove json parser - csv provides better performance

2dec6ec

dynamic configuration model

0454961

revised tests

0a39e0d

Updated to reset report type on reload #2361

94b22fb

Added schema download on startup and additional logging

b2f9bc3

Added version support to the schema update function.

d5cf063

Documentation and style updates.

1e6ea89

Added schema.json.test.license.

fc3f5b0

Updates in response to feedback.

b996e0e

elsif2 force-pushed the shadowserver-dynamic-config branch from d435ea3 to 516d534 Compare November 4, 2023 00:29

Added exception for missing schema and added intelmq user to the cron…

61c756d

…tab suggestion.

elsif2 force-pushed the shadowserver-dynamic-config branch from 516d534 to 61c756d Compare November 4, 2023 00:44

elsif2 added 2 commits November 13, 2023 17:14

Merge branch 'shadowserver-dynamic-config' into develop

a3a3aee

Documentation update.

307386d

Removed old unsorted doc and updated the taxonomy functions for the s…

ac04471

…can_exchange report.

elsif2 force-pushed the shadowserver-dynamic-config branch from ec0369d to ac04471 Compare November 16, 2023 15:54

elsif2 and others added 2 commits November 16, 2023 08:03

Merge branch 'develop' into shadowserver-dynamic-config

04c63a4

Merge branch 'develop' into shadowserver-dynamic-config

7a7a6a6

aaronkaplan requested a review from sebix November 28, 2023 09:22

aaronkaplan requested review from aaronkaplan and removed request for sebix November 29, 2023 19:37

Merge branch 'develop' into shadowserver-dynamic-config

0c0cb68

Merge branch 'develop' into shadowserver-dynamic-config

4743ba9

aaronkaplan approved these changes Dec 18, 2023

View reviewed changes

docs/user/bots.md Show resolved Hide resolved

aaronkaplan merged commit 4743ba9 into develop Dec 18, 2023
31 checks passed

sebix deleted the shadowserver-dynamic-config branch December 19, 2023 07:01

sebix reviewed Jan 22, 2024

View reviewed changes

		}


		def reload():

Shadowserver dynamic config #2372

Shadowserver dynamic config #2372

Conversation

elsif2 commented May 28, 2023 • edited Loading

Dynamic configuration

Parsers

Collectors

sebix commented May 30, 2023

elsif2 commented May 30, 2023 • edited Loading

kamil-certat left a comment

Choose a reason for hiding this comment

kamil-certat commented Jul 27, 2023

sebix commented Jul 27, 2023

elsif2 commented Jul 27, 2023

aaronkaplan commented Jul 27, 2023

aaronkaplan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebix commented Aug 2, 2023

aaronkaplan commented Aug 9, 2023

kamil-certat commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elsif2 commented Aug 24, 2023

kamil-certat commented Nov 14, 2023

aaronkaplan commented Nov 28, 2023

sebix commented Nov 29, 2023

aaronkaplan commented Nov 29, 2023

kamil-certat commented Nov 30, 2023

DigitalTrustCenter commented Dec 6, 2023

kamil-certat commented Dec 6, 2023 • edited Loading

DigitalTrustCenter commented Dec 12, 2023

elsif2 commented Dec 12, 2023

3.0.2

3.1.0

sebix commented Dec 13, 2023

aaronkaplan left a comment

Choose a reason for hiding this comment

aaronkaplan commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elsif2 commented May 28, 2023 •

edited

Loading

elsif2 commented May 30, 2023 •

edited

Loading

kamil-certat commented Dec 6, 2023 •

edited

Loading